Explainability - by Example
Discover how explainability can be implemented across traditional and generative AI models through practical examples in this section.
Linear Regression Model
Problem Statement: Inaccurate and inconsistent valuation of residential real estate properties results in significant financial risks for investors, buyers, and sellers. Current valuation methods rely heavily on expert judgment and limited data, leading to potential overvaluation or undervaluation.
Solution: Develop a Linear Regression model to accurately predict future market values of residential real estate properties based on a comprehensive dataset of property and market characteristics.
Training Data Details
Feature |
Description |
---|---|
MedInc |
Median Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars) [10k$] |
HouseAge |
Age: Median age of a house within a block; a lower number is a newer building [years] |
AveRooms |
Total number of rooms within a block |
AveBedrms |
Total number of bedrooms within a block |
Population |
Total number of people residing within a block |
AveOccup |
average number of household members |
Latitude |
A measure of how far north a house is; a higher value is farther north [°] |
Longitude |
A measure of how far west a house is; a higher value is farther west [°] |
MedHouseVal |
Median house value for households within a block (measured in US Dollars) [$] |
Global explainability in ML models refers to the ability to understand and interpret the overall behavior and decision-making process of a machine learning model across all its predictions, rather than just individual instances. It provides insights into how different features contribute to the model's predictions on average. Following are a few model-agnostic techniques that identify important features influencing model predictions.
Kernel SHAP
Kernel SHAP analysis indicates that income and occupancy are the top features affecting the model's predictions. This means these factors play a significant role in explaining how the model arrives at its conclusions. In simpler terms, changes in income and occupancy levels have the largest impact on predicting the future market value of residential real estate properties. |
![]() |
---|
Local explainability in ML models refers to the ability to understand and interpret the decision-making process of a model for specific instances or predictions. Unlike global explainability, which focuses on the overall behavior of the model, local explainability provides insights into how different features contribute to a model's prediction for a particular input. This helps users understand why the model made a specific decision and allows for greater trust and transparency in the model's outputs
LIME (Local Interpretable Model-agnostic Explanations)
LIME analysis shows that income, location, and occupancy are the most influential features for our model. This indicates that changes in these factors significantly impact the predictions for residential real estate market values. In simpler terms, variations in income, where a property is located, and how many people live in it play a crucial role in determining its future value. |
![]() |
---|
Other Explainer methods
SHAP (SHapley Additive exPlanations)
SHAP analysis reveals that income, location, and occupancy have the highest feature importances. This suggests these features significantly contribute to explaining the model's predictions. In simpler terms, variations in these features have the greatest influence on predicting the future market value of residential real estate properties. |
![]() Feature Importance |
---|
Permutation Importance
By shuffling the order of income, location, and occupancy, permutation importance assessed the impact on model performance. High importance for these features indicates that shuffling their values significantly disrupts the model's predictions. This suggests they're crucial for the model to make accurate decisions. |
![]() Variable Importance |
---|
Partial Dependence Variance
Decomposing the model's predictions by income, location, house age and occupancy reveals high partial dependence variance for these features. This signifies that the average prediction of the model exhibits substantial variation with changes in their values. In simpler terms, income, location, and occupancy independently exert a strong influence on the model's output. |
![]() |
---|
Binary Classifier Model
Problem Statement: The problem at hand is to accurately predict whether an individual is at risk of developing heart disease. This prediction can be made by analyzing various health-related factors, demographic information, and medical history.
Solution: Develop a Binary Classification model to accurately predict individuals who are at risk of heart disease so that preventive measures can be implemented to improve their health outcomes.
Training Data Details
Feature |
Description |
---|---|
age |
Age of the individual in years. |
sex |
Gender of the individual (1 = male, 0 = female). |
chest_pain_type |
Type of chest pain experienced (0-3 categorical values). |
resting_blood_pressure |
Resting blood pressure (in mm Hg) measured when the individual is at rest. |
serum_cholesterol |
Serum cholesterol level (in mg/dl). |
fasting_blood_sugar |
Fasting blood sugar level (> 120 mg/dl = 1, otherwise = 0). |
resting_ecg_results |
Resting electrocardiographic results (0-2 categorical values). |
max_heart_rate_achieved |
Maximum heart rate achieved during exercise (in bpm). |
exercise_induced_angina |
Exercise induced angina (1 = yes, 0 = no). |
oldpeak |
ST depression induced by exercise relative to rest. |
slope |
Slope of the peak exercise ST segment (0-2 categorical values). |
number_of_vessels_fluro |
Number of major vessels (0-3) colored by fluoroscopy. |
thalassemia |
Thalassemia status (1 = normal, 2 = fixed defect, 3 = reversible defect). |
is_disease |
Denotes whether the individual has heart disease (1 = yes, 0 = no). |
Global explainability in ML models refers to the ability to understand and interpret the overall behavior and decision-making process of a machine learning model across all its predictions, rather than just individual instances. It provides insights into how different features contribute to the model's predictions on average. Following are a few model-agnostic techniques that identify important features influencing model predictions.
Kernel SHAP
Global Kernel SHAP analysis reveals that sex, chest pain type, and the number of vessels colored by fluoroscopy are the most influential features across all predictions in our heart disease prediction model. This indicates that these factors consistently affect the risk assessment for heart disease across the entire dataset. In simpler terms, the overall patterns suggest that a person's gender, the type of chest pain they experience, and the number of major vessels affected are key indicators in evaluating heart disease risk |
![]() |
---|
Local explainability in ML models refers to the ability to understand and interpret the decision-making process of a model for specific instances or predictions. Unlike global explainability, which focuses on the overall behavior of the model, local explainability provides insights into how different features contribute to a model's prediction for a particular input. This helps users understand why the model made a specific decision and allows for greater trust and transparency in the model's outputs
LIME (Local Interpretable Model-agnostic Explanations)
LIME analysis reveals that sex, chest pain type, and the number of vessels colored by fluoroscopy are the most influential features for our heart disease prediction model. This suggests that changes in these factors significantly impact the model's predictions regarding an individual’s risk of developing heart disease. In simpler terms, variations in a person's gender, the type of chest pain they experience, and how many major vessels are affected are critical in assessing their likelihood of having heart disease. This highlights the importance of these features in understanding individual risk profiles |
![]() |
---|
Other Explainer methods
Problem Statement: The problem at hand is to accurately predict whether an employee is likely to leave an organization. This prediction can be made by analyzing various factors related to the employee's demographics, job satisfaction, and work environment.
Solution: Develop a Binary Classification model to accurately predict employees who are at risk of attrition so that proactive measures can be taken to retain them.
Training Data Details
Feature |
Description |
---|---|
num_production_digital_project_changes_last_12_months |
Number of changes made to a production-level digital project within the last 12 months |
pct_time_non_revenue_last_12_months |
Percentage of time an employee spent on non-revenue-generating activities (e.g., administrative tasks, meetings) in the past 12 months. |
emp_experience_diff_average_team_leadership_experience_last_9_months |
Difference between an employee's experience and the average leadership experience of their team in the past 9 months. |
num_promotions_in_past_2_years |
Number of promotions an employee has received in the past 2 years. |
emp_experience_diff_average_team_experience_3_vs_9_months |
Difference between an employee's experience and the average team experience at two different time points: 3 and 9 months ago. |
num_production_project_changes_last_6_months |
Number of changes or updates made to production projects in last 6 Months |
num_production_project_changes_last_9_months |
Number of changes or updates made to production projects in last 9 Months |
education_level_bachelors |
Indicates whether an employee has a bachelor's degree. |
education_level_masters |
Indicates whether an employee has a master's degree |
is_attrited |
Denotes whether the employee has resigned the job or not. |
SHAP (SHapley Additive exPlanations)
SHAP analysis assigns high importance to bench time, average cohort rating, education level, and number of projects changed. This indicates these features substantially influence the marginal contribution of each feature to the model's predictions. In other words, variations in these features significantly alter how much credit each feature receives for a specific prediction. |
![]() |
---|
Partial Dependence Variance
Partial dependence variance analysis identifies high importance for bench time, average cohort rating, education level, and number of projects changed. This means changes in these features lead to significant variations in the average prediction of the model. In simpler terms, these features independently have a strong effect on where the model's output lands on average. |
![]() |
---|
Anchor
Anchor analysis identifies a specific data point likely to exert significant influence on a model prediction. This data point is characterized by bench time exceeding 44.44 in the last 9 months, alongside a non-positive change in average team leadership rating between month 3 and month 2 |
![]() |
---|
Time Series Forecasting Model
Problem Statement: Insufficient inventory can result in lost sales due to unavailability of products. Overstocking can tie up capital and increase holding costs.
Solution: Accurate sales forecasting is essential for effective business operations, financial stability, and customer satisfaction. Failure to forecast sales can lead to a variety of problems, including inventory issues, financial difficulties, operational inefficiencies, missed market opportunities, and reduced customer satisfaction.
Training Data Details
Feature |
Description |
---|---|
Store |
Store number |
Dept |
Department number |
IsHoliday |
Whether the week is a special holiday week |
Type |
Stores has 3 types as A, B and C according to their sizes. Almost half of the stores are bigger than 150000 and categorized as A. |
Size |
Stores sizes |
Temperature |
Average temperature in the region |
Fuel_Price |
Cost of Fuel in the region |
MarkDown1 |
Anonymized data related to promotional markdowns that Walmart is running |
MarkDown2 |
Anonymized data related to promotional markdowns that Walmart is running. |
MarkDown3 |
Anonymized data related to promotional markdowns that Walmart is running |
MarkDown4 |
Anonymized data related to promotional markdowns that Walmart is running |
MarkDown5 |
Anonymized data related to promotional markdowns that Walmart is running |
CPI |
The consumer price index |
Unemployment |
The unemployment rate |
Day |
Day |
Week |
Week |
Month |
Month |
Quarter |
Quarter |
Year |
Year |
Weekly_Sales |
Sales for the given department in the given store |
Explanation:
LIME: The LIME analysis identifies critical factors such as store size, weekly holidays, regional fuel prices, month, and quarter, which are pivotal for the model's decision-making process in forecasting weekly sales based on the historical time series data.

Chain of Thought
Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.
Solution: Chain of thought Reasoning mirrors human reasoning. It facilitates systematic problem-solving by breaking down complex tasks into a coherent series of logical deductions.
Explanation:
Prompt: What is the largest river in India?
Chain of thought gives the detailed step by step reasoning for the LLM response to above prompt.

Thread of Thought
Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.
Solution: Thread of thought Reasoning mirrors human reasoning. It facilitates systematic problem-solving by breaking down complex tasks into a coherent series of logical deductions.
Explanation:
Prompt: What is the largest river in India?
Thread of thought gives the detailed step by step reasoning for the LLM response to above prompt.

ReRead Reasoning
Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.
Solution: ReRead Reasoning mirrors human reasoning. Unlike most thought eliciting prompting methods, such as Chain-of Thought (CoT), which aim to elicit the reasoning process in the output, RE2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process. Consequently, RE2 demonstrates strong generality and compatibility with most thought eliciting prompting methods
Explanation:
Prompt: What is the largest river in India?
Thread of thought gives the detailed step by step reasoning for the LLM response to above prompt.

Graph of Thought
Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.
Solution: Graph of thought Reasoning mirrors human reasoning. It facilitates systematic problem-solving by breaking down complex tasks into a coherent series of logical deductions.
Explanation:
Prompt: What is the largest river in India?
Graph of thought gives the detailed step by step reasoning for the LLM response to above prompt.

Chain of verification
Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.
Solution: Chain of verification helps to understand the LLM response by verifying the base answer with various question and answers.
Verification:
Prompt: What is the largest river in India?
Chain of verification asks the LLM with 5 different verification questions with the context of actual LLM reasoning and based on the gives 5 answers it derives the final answer.

Token Importance
Problem Statement: In AI language models, the importance of tokens can significantly influence the generated responses. Understanding which tokens (words or phrases) are most impactful can be crucial for interpreting and trusting the model's decisions. This is particularly relevant in applications where precise and reliable outputs are essential, such as in healthcare, finance, and legal domains.
Solution: Token Importance helps in understanding how different tokens contribute to the AI model's responses. By analyzing the relative importance of tokens, users can gain insight into which parts of the input significantly affect the model’s output. This can enhance transparency and trust in the AI system's decision-making process.
Explanation:
Prompt: What is the largest river in India?
Displays matrix with top 10 tokens and their importance
The Token Importance Distribution Chart illustrates the significance of individual tokens by displaying the distribution of their associated impact scores. The chart's shape reveals the following insights:
Flat Distribution: Tokens have similar importance, with no clear standout
Left-Peaked Distribution: Tokens have low impact scores, indicating lesser importance
Right-Peaked Distribution: Tokens have high impact scores, signifying greater importance
Displays importance of each token (considered top 10 tokens based on their importance for this chart).

Search Augmentation
Problem Statement: When users cannot discern how AI systems reach their conclusions, it can undermine trust in the technology. This issue is particularly pressing in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences. Ensuring the accuracy and transparency of AI responses is crucial for maintaining user confidence.
Solution: The Chain of Verification through Internet Search offers a systematic approach to validate the accuracy of an AI response by cross-referencing it with multiple reliable online sources. This method involves querying different authoritative sources to confirm the correctness of the AI's answer and provide clarity on how the response was derived.
Explanation:
Prompt: What is the largest river in India?
Internet Search displays final response by cross validating facts generated by thread of thoughts against internet search results.
List of facts used by thread of thoughts while reasoning the LLM response
Explanation based on internet search results
Judgement on the accuracy of LLM response. The validation process compares the LLM's thread of thoughts against internet search results.
No: Internet search results contradict the LLM's facts, indicating potential inaccuracies.
Yes: Internet search results support the LLM's facts, confirming their validity.
Unclear: Internet search results lack sufficient information to determine the accuracy of the LLM's response, requiring further investigation.

Logic of Thought
Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.
Solution: The logic of thought technique could encompass a variety of methodologies for understanding, analyzing human thinking or reasoning. In essence, it involves formalized techniques for logical reasoning (like deductive reasoning, critical thinking). The purpose of these techniques is to enhance clarity of thought, ensure decisions are made based on sound logic, and improve the cognitive processes we use to navigate the world.
Explanation:
Prompt: Which is the largest river in India?
The Logic of Thought (LoT) extracts propositions and logical expressions, extending them to generate expanded logical information from the input context. This generated logical information is then utilized as an additional augmentation to the input prompts, thereby enhancing the system's logical reasoning capabilities.

Evaluation Metrics
Problem Statement: Defining "correct" or "good" explanations can be ambiguous and context dependent. there is no standardized and universally accepted metrics to objectively quantify the quality of explanations generated by LLMs, their overall performance, and the user's satisfaction with the interaction.
Solution: New research on LLM explanation “QUEST” which means Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence, talks about providing explanation for LLM with the following metrics such as Uncertainty, Relevance, coherence, Language Tone, Sentiment Analysis etc. With the help of prompt engineering, we ask the LLM to evaluate the above discussed metrics score and provide the explanation for the same.
Metrics:
Prompt: What is the largest river in India?
Uncertainty quantification and Coherence score are the two-evaluation metrics we have implemented to quantify the quality of the explanations generated by the LLM. having high score of coherence helps to understand how logically the given answer is aligned with the actual query and having less uncertainty score denotes that the LLM has high confidence in providing this answer.
Coherence
Less Coherent: >=0 and <=30
Moderately Coherent: >30 and <=70
Highly Coherent: >70 and <=100
Certainty
Highly Certain: >=0 and <=30 (less uncertainty)
Moderately Certain: >30 and <=70 (moderately uncertain)
Less Certain: >70 and <=100 (highly uncertain)

Note: Few examples, datasets and graphs used in this section are sourced from publicly available information and are attributed to their respective creators.